The objective of this project is to generate text/caption from image without any human interfence.
Que: What will be its exact business use?
Ans: Its exact business use will be in the web search using an image, it can make blind person to understand what is present in surrounding and it can make a robot to understand as well and etc.
Que: Do you need very less latency model?
Ans: Yes, in this case we need very less latency model because in our use case more latency can be an issue in understanding the surroundings for robots and blind persons. In image search less latency will give quick image search result.
Que: Do you do predictions in batch mode?
Ans: No, we do not make predictions in batch mode.
Que: Do you need very less latency model?
Ans: Yes, in this case we need very less latency model because in our use case more latency can be an issue in understanding the surroundings for robots and blind persons. In image search less latency will give quick image search result on web.
Que: Do you need real-time-model or batch-mode-model?
Ans: The model should be real-time-model because we need text from image in real-time.
1. In this project named Image Captioning, we will develop a deep learning based model which generate text for an image based on visual context present in the respected image.
2. This project can be used to extract text from huge image datasets.
3. This can also be used as image search means if you have have an image upload into the search engine image captioner will generate text based on the visual context present in the image and perform web search using generated text.
4. This can also be used for robots to understand what it is seeing.
5. Or person who can't see, by some means if we implement it in the blind person goggles then this model captures the images in some fixed timeinterval or by users command and generate text from captured image based on visual context present in the image and say it to the blind person by using text to voice generation method.
1. The dataset is used in this project is very huge dataset particulary known as ms-coco-2017 image dataset.
2. In this dataset there are three sets: train, val and test are present.
3. Train set have 118K images, test set have 40K images and val set have 5K images with annotations (test set do not have annotations here)
4. The data set has been downloaded from kaggle.com
1. Since, the project involves computer vision and natural language processing.
First we need to extract the features from image using pre-trained model this feature extraction step is one time only hence we do not need any metric for this.
2. After feature extraction step, the extracted features and the tokenized captions will be passed to the sequence-to-sequence model that will be trained.
3. Now, during training we need a metric to monitor/evaluate the training. Since the output of the model is tokenized caption. Hence we will use the categorical crossentropy loss.
4. Once, the model is trained and achieves minimal loss.
Then, we will generate/predict text for image and compare them with original caption. For this task we will use BLEU score from NLTK.
5. Finally, we will compare the BLEU score for train and val set to get generalised model.
6. Read more on BLEU score here.
import os
import glob
import pickle as pkl
import tqdm as tqdm
import gc
import re
import json
!pip install contractions
import contractions
Defaulting to user installation because normal site-packages is not writeable Requirement already satisfied: contractions in c:\users\shail\appdata\roaming\python\python39\site-packages (0.1.72) Requirement already satisfied: textsearch>=0.0.21 in c:\users\shail\appdata\roaming\python\python39\site-packages (from contractions) (0.0.24) Requirement already satisfied: pyahocorasick in c:\users\shail\appdata\roaming\python\python39\site-packages (from textsearch>=0.0.21->contractions) (1.4.4) Requirement already satisfied: anyascii in c:\users\shail\appdata\roaming\python\python39\site-packages (from textsearch>=0.0.21->contractions) (0.3.1)
root_dir = "D:\\projects\\image-captioning"
annotations_dir = "D:\\MS-COCO-2017-dataset\\coco2017\\annotations\\captions*.json"
caption_files = glob.glob(annotations_dir)
caption_files
['D:\\MS-COCO-2017-dataset\\coco2017\\annotations\\captions_train2017.json', 'D:\\MS-COCO-2017-dataset\\coco2017\\annotations\\captions_val2017.json']
def load_raw_caption(path):
with open(path, 'r') as f:
raw_captions = json.load(f)
return raw_captions
raw_train_captions = load_raw_caption(caption_files[0])
raw_val_captions = load_raw_caption(caption_files[1])
raw_val_captions.keys()
dict_keys(['info', 'licenses', 'images', 'annotations'])
raw_val_captions['annotations'][0]
{'image_id': 179765,
'id': 38,
'caption': 'A black Honda motorcycle parked in front of a garage.'}
raw_val_captions['images'][0]
{'license': 4,
'file_name': '000000397133.jpg',
'coco_url': 'http://images.cocodataset.org/val2017/000000397133.jpg',
'height': 427,
'width': 640,
'date_captured': '2013-11-14 17:02:52',
'flickr_url': 'http://farm7.staticflickr.com/6116/6255196340_da26cf2c9e_z.jpg',
'id': 397133}
def extract_words(sentence):
sentence = ' '.join(contractions.fix(word) for word in sentence.split())
return sentence
def preprocess(sentence):
sentence = sentence.lower()
sentence = re.sub("[^a-zA-Z]", ' ', sentence)
sentence = ' '.join(x for x in sentence.split() if x.isalpha() and len(x)>1)
sentence = extract_words(sentence)
return sentence
# def preprocess_and_save_captions(raw_captions, name):
# print(f"Extracting and cleaning {name} captions...", end='')
# id_to_file_name = dict()
# for data in raw_captions['images']:
# id_to_file_name[data['id']] = data['file_name']
# id_to_captions = dict()
# for data in raw_captions['annotations']:
# id_to_captions[data['image_id']] = data['caption']
# captions = dict()
# for ids, file_name in id_to_file_name.items():
# cap = preprocess(id_to_captions[ids])
# captions[file_name] = '<SOS> ' + cap + ' <EOS>'
# print(f"saving...", end='')
# with open(os.path.join(root_dir, name+'_captions_df.pkl'), 'wb') as f:
# pkl.dump(captions, f)
# print(f"done!")
# return captions
def preprocess_and_save_captions(raw_captions, name):
print(f"Extracting and cleaning {name} captions...", end='')
id_to_file_name = dict()
for data in raw_captions['images']:
id_to_file_name[data['id']] = data['file_name']
id_to_captions = dict()
i = 0
for data in raw_captions['annotations']:
if data['image_id'] not in id_to_captions:
id_to_captions[data['image_id']] = [data['caption']]
else:
id_to_captions[data['image_id']].append(data['caption'])
captions = dict()
for ids, file_name in id_to_file_name.items():
captions_list = []
for cap in id_to_captions[ids]:
preprocessed_cap = preprocess(cap)
captions_list.append('<sos> ' + preprocessed_cap + ' <eos>')
captions[file_name] = captions_list
print('done!', end='')
print(f"saving...", end='')
with open(os.path.join(root_dir, name+'_captions_list.pkl'), 'wb') as f:
pkl.dump(captions, f)
print(f"done!")
return captions
train_captions = preprocess_and_save_captions(raw_train_captions, 'train')
val_captions = preprocess_and_save_captions(raw_val_captions, 'val')
Extracting and cleaning train captions...done!saving...done! Extracting and cleaning val captions...done!saving...done!
import matplotlib.pyplot as plt
import seaborn as sns
import numpy as np
import cv2 as cv
images_root_dir = "D:\\MS-COCO-2017-dataset\\coco2017"
i=1
for img_name, img_captions in val_captions.items():
img_path = os.path.join(images_root_dir, 'val2017', img_name)
for cap in img_captions:
print(cap)
img = cv.imread(img_path)
img = cv.cvtColor(img, cv.COLOR_BGR2RGB)
plt.imshow(img)
plt.show()
print('')
print('-'*127)
if i == 10:
break
i+=1
<sos> man is in kitchen making pizzas <eos> <sos> man in apron standing on front of oven with pans and bakeware <eos> <sos> baker is working in the kitchen rolling dough <eos> <sos> person standing by stove in kitchen <eos> <sos> table with pies being made and person standing near wall with pots and pans hanging on the wall <eos>
------------------------------------------------------------------------------------------------------------------------------- <sos> the dining table near the kitchen has bowl of fruit on it <eos> <sos> small kitchen has various appliances and table <eos> <sos> the kitchen is clean and ready for us to see <eos> <sos> kitchen and dining area decorated in white <eos> <sos> kitchen that has bowl of fruit on the table <eos>
------------------------------------------------------------------------------------------------------------------------------- <sos> person with shopping cart on city street <eos> <sos> city dwellers walk by as homeless man begs for cash <eos> <sos> people walking past homeless man begging on city street <eos> <sos> homeless man holding cup and standing next to shopping cart on street <eos> <sos> people are walking on the street by homeless person <eos>
------------------------------------------------------------------------------------------------------------------------------- <sos> person on skateboard and bike at skate park <eos> <sos> man on skateboard performs trick at the skate park <eos> <sos> skateboarder jumps into the air as he performs skateboard trick <eos> <sos> athletes performing tricks on bmx bicycle and skateboard <eos> <sos> man falls off his skateboard in skate park <eos>
------------------------------------------------------------------------------------------------------------------------------- <sos> blue bike parked on side walk <eos> <sos> bicycle is chained to fixture on city street <eos> <sos> blue bicycle sits on sidewalk near street <eos> <sos> bicycle is locked up to post <eos> <sos> bike sits parked next to street <eos>
------------------------------------------------------------------------------------------------------------------------------- <sos> bathroom that has broken wall in the shower <eos> <sos> bathroom looks clean but is missing tile at the shower stall <eos> <sos> view of bathroom that needs to be fixed up <eos> <sos> shower toilet and sink in basement bathroom <eos> <sos> very big whit rest room with shabby looking shower <eos>
------------------------------------------------------------------------------------------------------------------------------- <sos> couple of buckets in white room <eos> <sos> bathroom with no toilets and red and green bucket <eos> <sos> shower room with two buckets tolet paper holder and soap <eos> <sos> standing toilet in bathroom next to window <eos> <sos> this picture looks like janitors closet with buckets on the floor <eos>
------------------------------------------------------------------------------------------------------------------------------- <sos> the shiny motorcycle has been put on display <eos> <sos> the new motorcycle on display is very shiny <eos> <sos> motorcycle is parked inside of building <eos> <sos> brand new motorcycle on display at show <eos> <sos> the front end of red motorcycle that is on display <eos>
------------------------------------------------------------------------------------------------------------------------------- <sos> row of white toilets sitting on top of dirt ground <eos> <sos> bunch of dirty looking white toilets in row outside <eos> <sos> there is row of used and broken toilets in field <eos> <sos> row of toilets with broken seat tops on the ground <eos> <sos> lots of broken toilets sitting out on lawn <eos>
------------------------------------------------------------------------------------------------------------------------------- <sos> small closed toilet in cramped space <eos> <sos> tan toilet and sink combination in small room <eos> <sos> this is an advanced toilet with sink and control panel <eos> <sos> close up picture of toilet with fountain <eos> <sos> off white toilet with faucet and controls <eos>
-------------------------------------------------------------------------------------------------------------------------------
train_sent_length, val_sent_length = [], []
for k,v in train_captions.items():
for cap in v:
train_sent_length.append(len(cap.split(' ')))
for k,v in val_captions.items():
for cap in v:
val_sent_length.append(len(cap.split(' ')))
train_sent_length[:5], val_sent_length[:5]
([12, 12, 9, 25, 15], [8, 14, 10, 8, 20])
len(train_sent_length), len(val_sent_length)
(591753, 25014)
sns.kdeplot(train_sent_length)
plt.title("pdf of sentences/captions length for train captions")
plt.xlabel('Count of words in a sentence')
plt.show()
sns.histplot(train_sent_length)
plt.title("Histogram of sentences/captions length for train captions")
plt.xlabel('Count of words in a sentence')
plt.ylabel('Count of sentences')
plt.show()
print(f"Percentile of length of sentences/captions between 0th-100th with gap of 10 for train captions")
print('-'*100)
for i in range(0,101,10):
print(f"# {i}th percentile: {np.percentile(train_sent_length, i)}")
Percentile of length of sentences/captions between 0th-100th with gap of 10 for train captions ---------------------------------------------------------------------------------------------------- # 0th percentile: 3.0 # 10th percentile: 9.0 # 20th percentile: 9.0 # 30th percentile: 10.0 # 40th percentile: 10.0 # 50th percentile: 10.0 # 60th percentile: 11.0 # 70th percentile: 11.0 # 80th percentile: 12.0 # 90th percentile: 13.0 # 100th percentile: 49.0
print(f"Percentile of length of sentences/captions between 90th-100th with gap of 1 for train captions")
print('-'*100)
for i in range(90,101):
print(f"# {i}th percentile: {np.percentile(train_sent_length, i)}")
Percentile of length of sentences/captions between 90th-100th with gap of 1 for train captions ---------------------------------------------------------------------------------------------------- # 90th percentile: 13.0 # 91th percentile: 14.0 # 92th percentile: 14.0 # 93th percentile: 14.0 # 94th percentile: 14.0 # 95th percentile: 15.0 # 96th percentile: 15.0 # 97th percentile: 16.0 # 98th percentile: 17.0 # 99th percentile: 18.0 # 100th percentile: 49.0
print(f"Percentile of length of sentences/captions between 99th-100th with gap of 0.1 for train captions")
print('-'*100)
for i in [99.1, 99.2, 99.3, 99.4, 99.5, 99.6, 99.7, 99.8, 99.9, 100]:
print(f"# {i}th percentile: {np.percentile(train_sent_length, i)}")
Percentile of length of sentences/captions between 99th-100th with gap of 0.1 for train captions ---------------------------------------------------------------------------------------------------- # 99.1th percentile: 19.0 # 99.2th percentile: 19.0 # 99.3th percentile: 20.0 # 99.4th percentile: 20.0 # 99.5th percentile: 21.0 # 99.6th percentile: 21.0 # 99.7th percentile: 22.0 # 99.8th percentile: 24.0 # 99.9th percentile: 27.0 # 100th percentile: 49.0
Observation
sns.kdeplot(val_sent_length)
plt.title("pdf of sentences/captions length for val captions")
plt.xlabel('Count of words in a sentence')
plt.show()
sns.histplot(val_sent_length)
plt.title("Histogram of sentences/captions length for val captions")
plt.xlabel('Count of words in a sentence')
plt.ylabel('Count of sentences')
plt.show()
print(f"Percentile of length of sentences/captions between 0th-100th with gap of 10 for val captions")
print('-'*100)
for i in range(0,101,10):
print(f"# {i}th percentile: {np.percentile(val_sent_length, i)}")
Percentile of length of sentences/captions between 0th-100th with gap of 10 for val captions ---------------------------------------------------------------------------------------------------- # 0th percentile: 7.0 # 10th percentile: 9.0 # 20th percentile: 9.0 # 30th percentile: 10.0 # 40th percentile: 10.0 # 50th percentile: 10.0 # 60th percentile: 11.0 # 70th percentile: 11.0 # 80th percentile: 12.0 # 90th percentile: 13.0 # 100th percentile: 47.0
print(f"Percentile of length of sentences/captions between 90th-100th with gap of 1 for val captions")
print('-'*100)
for i in range(90,101):
print(f"# {i}th percentile: {np.percentile(val_sent_length, i)}")
Percentile of length of sentences/captions between 90th-100th with gap of 1 for val captions ---------------------------------------------------------------------------------------------------- # 90th percentile: 13.0 # 91th percentile: 14.0 # 92th percentile: 14.0 # 93th percentile: 14.0 # 94th percentile: 14.0 # 95th percentile: 15.0 # 96th percentile: 15.0 # 97th percentile: 16.0 # 98th percentile: 16.0 # 99th percentile: 18.0 # 100th percentile: 47.0
print(f"Percentile of length of sentences/captions between 99th-100th with gap of 0.1 for val captions")
print('-'*100)
for i in [99.1, 99.2, 99.3, 99.4, 99.5, 99.6, 99.7, 99.8, 99.9, 100]:
print(f"# {i}th percentile: {np.percentile(val_sent_length, i)}")
Percentile of length of sentences/captions between 99th-100th with gap of 0.1 for val captions ---------------------------------------------------------------------------------------------------- # 99.1th percentile: 18.0 # 99.2th percentile: 19.0 # 99.3th percentile: 19.0 # 99.4th percentile: 20.0 # 99.5th percentile: 20.0 # 99.6th percentile: 21.0 # 99.7th percentile: 22.0 # 99.8th percentile: 23.0 # 99.9th percentile: 26.0 # 100th percentile: 47.0
Observation
!pip install wordcloud
Defaulting to user installation because normal site-packages is not writeable Requirement already satisfied: wordcloud in c:\users\shail\appdata\roaming\python\python39\site-packages (1.8.2.2) Requirement already satisfied: pillow in c:\programdata\anaconda3\lib\site-packages (from wordcloud) (9.0.1) Requirement already satisfied: numpy>=1.6.1 in c:\programdata\anaconda3\lib\site-packages (from wordcloud) (1.21.5) Requirement already satisfied: matplotlib in c:\programdata\anaconda3\lib\site-packages (from wordcloud) (3.5.1) Requirement already satisfied: packaging>=20.0 in c:\programdata\anaconda3\lib\site-packages (from matplotlib->wordcloud) (21.3) Requirement already satisfied: pyparsing>=2.2.1 in c:\programdata\anaconda3\lib\site-packages (from matplotlib->wordcloud) (3.0.4) Requirement already satisfied: python-dateutil>=2.7 in c:\programdata\anaconda3\lib\site-packages (from matplotlib->wordcloud) (2.8.2) Requirement already satisfied: kiwisolver>=1.0.1 in c:\programdata\anaconda3\lib\site-packages (from matplotlib->wordcloud) (1.3.2) Requirement already satisfied: cycler>=0.10 in c:\programdata\anaconda3\lib\site-packages (from matplotlib->wordcloud) (0.11.0) Requirement already satisfied: fonttools>=4.22.0 in c:\programdata\anaconda3\lib\site-packages (from matplotlib->wordcloud) (4.25.0) Requirement already satisfied: six>=1.5 in c:\programdata\anaconda3\lib\site-packages (from python-dateutil>=2.7->matplotlib->wordcloud) (1.16.0)
from wordcloud import WordCloud
def plot_word_cloud(captions, name):
captions_text = ''
for k,v in captions.items():
for cap in v:
captions_text += ' '.join(x for x in cap.split(' ') if x != '<eos>' and x != '<sos>')
print(f"{len(captions_text.split(' '))} words in the {name} captions.")
print('-'*100)
word_cloud = WordCloud(background_color='white', width=1600, height=1000, max_words=10000, collocations=False).generate(captions_text)
plt.figure(figsize=(20,20))
plt.imshow(word_cloud)
plt.axis('off')
# plt.title(f"Word Cloud for {name} captions")
plt.show()
plot_word_cloud(train_captions, 'train')
4619529 words in the train captions. ----------------------------------------------------------------------------------------------------
plot_word_cloud(val_captions, 'val')
195017 words in the val captions. ----------------------------------------------------------------------------------------------------
Observation
gc.collect()
63441